Simple Linear Regression

From Theory to Practice in R

Econometrics - Nova SBE

Roadmap

In the previous lectures we covered the algebra of Simple Linear Regression on the whiteboard.

Today we bring it to life with data, code, and simulations.

  1. Kaggle
  2. Quick R warm-up
  3. Exploring a real dataset
  4. The OLS estimator — by hand
  5. Using lm() — R does it for us
  6. Inference — by hand vs. summary()
  7. Simulation: \(\hat{\beta}_1\) is a random variable
  8. What affects precision?
  9. Residual diagnostics
  10. Confidence and prediction bands

This lecture

library(qrcode)

plot(qr_code('https://pfagandini.github.io/novasbe_econometrics/R_SLR_Lecture_Slides.html'))

Part 0: Kaggle

Kaggle

Go to this website, and login however you like, I log in with my google account.

https://www.kaggle.com

Kaggle

Click in “Create”

And select “Notebook”

Kaggle

Sometimes it defaults to Python, but it learns your preference (use R often and it will start creating R notebooks by default)

To check/set an R environment:

  1. Go to the File menu
  2. Select Language
  3. Chose R
  4. Click the Start Session icon (a circle with a vertical line, top right corner of the notebook)

Part I: R Basics

Vectors: the building block of R

# A vector is the most basic data structure in R
x <- c(1, 2, 3, 4, 5)
x
[1] 1 2 3 4 5
# R is vectorized: operations apply element by element
x^2
[1]  1  4  9 16 25
# Summary statistics
cat("Mean:", mean(x), " | Variance:", var(x), " | Std Dev:", sd(x), "\n")
Mean: 3  | Variance: 2.5  | Std Dev: 1.581139 

Note

var() uses \(n-1\) (sample variance), not \(n\).

Quick plotting

y <- c(2.1, 3.9, 6.2, 7.8, 10.1)

plot(x, y, pch = 19, col = "steelblue",
     main = "My First Scatter Plot", xlab = "x", ylab = "y")
grid()

Part II: A Real Dataset

The mtcars dataset

Built into R: data on 32 automobiles (1974 Motor Trend magazine).

Our question: How does the weight of a car affect its fuel consumption?

\[\text{mpg}_i = \beta_0 + \beta_1 \cdot \text{wt}_i + u_i\]

data(mtcars)
head(mtcars[, c("mpg", "wt", "hp", "cyl")], 8)
                   mpg    wt  hp cyl
Mazda RX4         21.0 2.620 110   6
Mazda RX4 Wag     21.0 2.875 110   6
Datsun 710        22.8 2.320  93   4
Hornet 4 Drive    21.4 3.215 110   6
Hornet Sportabout 18.7 3.440 175   8
Valiant           18.1 3.460 105   6
Duster 360        14.3 3.570 245   8
Merc 240D         24.4 3.190  62   4

Descriptive statistics

y <- mtcars$mpg   # miles per gallon (dependent variable)
x <- mtcars$wt    # weight in 1000 lbs (independent variable)
summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.513   2.581   3.325   3.217   3.610   5.424 

Visual exploration

par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))

hist(y, breaks = 10, col = "steelblue", border = "white",
     main = "Distribution of mpg", xlab = "Miles per Gallon")

hist(x, breaks = 10, col = "coral", border = "white",
     main = "Distribution of Weight", xlab = "Weight (1000 lbs)")

plot(x, y, pch = 19, col = "steelblue",
     main = "mpg vs Weight", xlab = "Weight (1000 lbs)", ylab = "mpg")
grid()

Correlation

cor(x, y)
[1] -0.8676594

Strong negative linear relationship. Heavier cars tend to have lower fuel efficiency.

Question: Looking at the scatter plot, does a straight line seem reasonable? What sign should the slope have?

Part III: OLS by Hand

The formulas (from the whiteboard)

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Let’s compute this step by step.

Step 1: Means

n <- length(x)
x_bar <- mean(x)
y_bar <- mean(y)
cat("n =", n, "\n")
n = 32 
cat("x̄ =", x_bar, "\n")
x̄ = 3.21725 
cat("ȳ =", y_bar, "\n")
ȳ = 20.09062 

Step 2: Deviations from the mean

dx <- x - x_bar
dy <- y - y_bar

# First 6 observations
data.frame(
  car = rownames(mtcars)[1:6],
  x = x[1:6], y = y[1:6],
  dx = round(dx[1:6], 3), dy = round(dy[1:6], 3),
  dx_times_dy = round(dx[1:6] * dy[1:6], 3)
)
                car     x    y     dx     dy dx_times_dy
1         Mazda RX4 2.620 21.0 -0.597  0.909      -0.543
2     Mazda RX4 Wag 2.875 21.0 -0.342  0.909      -0.311
3        Datsun 710 2.320 22.8 -0.897  2.709      -2.431
4    Hornet 4 Drive 3.215 21.4 -0.002  1.309      -0.003
5 Hornet Sportabout 3.440 18.7  0.223 -1.391      -0.310
6           Valiant 3.460 18.1  0.243 -1.991      -0.483

Step 3: \(S_{xy}\) and \(S_{xx}\)

Sxy <- sum(dx * dy)
Sxx <- sum(dx^2)
cat("Sxy =", Sxy, "\n")
Sxy = -158.6172 
cat("Sxx =", Sxx, "\n")
Sxx = 29.67875 

Step 4: The estimates

beta1_hat <- Sxy / Sxx
beta0_hat <- y_bar - beta1_hat * x_bar

cat("β̂₀ =", round(beta0_hat, 4), "\n")
β̂₀ = 37.2851 
cat("β̂₁ =", round(beta1_hat, 4), "\n")
β̂₁ = -5.3445 
cat("\nFor each additional 1000 lbs of weight,\n")

For each additional 1000 lbs of weight,
cat("mpg decreases by", round(abs(beta1_hat), 2), "miles per gallon\n")
mpg decreases by 5.34 miles per gallon
cat("on average, ceteris paribus.")
on average, ceteris paribus.

The fitted line

Code
plot(x, y, pch = 19, col = "steelblue", cex = 1.3,
     main = "OLS Regression Line (computed by hand)",
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")

abline(a = beta0_hat, b = beta1_hat, col = "red", lwd = 2.5)

points(x_bar, y_bar, pch = 4, col = "red", cex = 2.5, lwd = 3)

text(x_bar + 0.15, y_bar + 1.2,
     expression(paste("(", bar(x), ", ", bar(y), ")")), col = "red", cex = 1.1)

grid()

legend("topright",
       legend = c(paste0("ŷ = ", round(beta0_hat, 2), " + (",
                         round(beta1_hat, 2), ") × wt"), "Point of means"),
       col = c("red", "red"), lty = c(1, NA), pch = c(NA, 4),
       lwd = c(2.5, 3), bty = "n")

Important property

The OLS regression line always passes through the point of means \((\bar{x}, \bar{y})\).

Why? From the formula:

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Substitute \(x = \bar{x}\):

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \bar{x} = (\bar{y} - \hat{\beta}_1 \bar{x}) + \hat{\beta}_1 \bar{x} = \bar{y}\]

Visualizing the residuals

Code
y_hat <- beta0_hat + beta1_hat * x
resid <- y - y_hat

plot(x, y, pch = 19, col = "steelblue", cex = 1.3,
     main = "Residuals: the vertical distances",
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")
abline(a = beta0_hat, b = beta1_hat, col = "red", lwd = 2)
segments(x0 = x, y0 = y, x1 = x, y1 = y_hat, col = "gray40", lty = 2)
points(x, y_hat, pch = 1, col = "red", cex = 1)
grid()
legend("topright",
       legend = c("Observed (y)", "Fitted (ŷ)", "Residual (y − ŷ)"),
       col = c("steelblue", "red", "gray40"),
       pch = c(19, 1, NA), lty = c(NA, NA, 2), bty = "n")

Properties of residuals

cat("Sum of residuals:", round(sum(resid), 10), "\n")
Sum of residuals: 0 
cat("Cor(x, residuals):", round(cor(x, resid), 10), "\n")
Cor(x, residuals): 0 
cat("Mean of fitted values:", round(mean(y_hat), 4), "\n")
Mean of fitted values: 20.0906 
cat("Mean of y:            ", round(mean(y), 4), "\n")
Mean of y:             20.0906 
cat("  → They are equal!\n")
  → They are equal!

Part IV: lm() — R Does It for Us

Fitting and verifying

model <- lm(mpg ~ wt, data = mtcars)
coef(model)
(Intercept)          wt 
  37.285126   -5.344472 
cat("By hand:  β̂₀ =", round(beta0_hat, 4), "  β̂₁ =", round(beta1_hat, 4), "\n")
By hand:  β̂₀ = 37.2851   β̂₁ = -5.3445 
cat("lm():     β̂₀ =", round(coef(model)[1], 4),
    "  β̂₁ =", round(coef(model)[2], 4), "\n")
lm():     β̂₀ = 37.2851   β̂₁ = -5.3445 
cat("\n✓ They match!")

✓ They match!

Part V: Inference by Hand

Step 1: Estimate \(\sigma^2\)

\[\hat{\sigma}^2 = \frac{\text{SSR}}{n - 2} = \frac{\sum_{i=1}^{n} \hat{u}_i^2}{n - 2}\]

SSR <- sum(resid^2)
df <- n - 2
sigma2_hat <- SSR / df
sigma_hat <- sqrt(sigma2_hat)

cat("SSR =", round(SSR, 4), "\n")
SSR = 278.3219 
cat("df  = n - 2 =", df, "\n")
df  = n - 2 = 30 
cat("σ̂²  =", round(sigma2_hat, 4), "\n")
σ̂²  = 9.2774 
cat("σ̂   =", round(sigma_hat, 4), " (Residual Standard Error)\n")
σ̂   = 3.0459  (Residual Standard Error)

Step 2: Standard errors

\[\text{se}(\hat{\beta}_1) = \frac{\hat{\sigma}}{\sqrt{S_{xx}}}\]

se_beta1 <- sigma_hat / sqrt(Sxx)

cat("se(β̂₁) =", round(se_beta1, 4), "\n")
se(β̂₁) = 0.5591 

Step 3: t-statistics and p-values

Under \(H_0: \beta_j = 0\): \(\quad t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-2}\)

t_beta1 <- beta1_hat / se_beta1
p_beta1 <- 2 * pt(abs(t_beta1), df = df, lower.tail = FALSE)

cat(sprintf("         Estimate   Std.Error   t-value    p-value\n"))
         Estimate   Std.Error   t-value    p-value
cat(sprintf("β̂₁  %10.4f  %10.4f  %8.3f   %.2e\n",
            beta1_hat, se_beta1, t_beta1, p_beta1))
β̂₁     -5.3445      0.5591    -9.559   1.29e-10

Visualizing the t-test for \(\beta_1\)

Code
alpha <- 0.05
t_crit <- qt(1 - alpha/2, df = df)

curve(dt(x, df = df), from = -15, to = 15, n = 500, lwd = 2, col = "steelblue",
      main = expression(paste("t-test for ", H[0], ": ", beta[1], " = 0")),
      xlab = "t", ylab = "Density")

x_left <- seq(-15, -t_crit, length.out = 200)
x_right <- seq(t_crit, 15, length.out = 200)
polygon(c(x_left, rev(x_left)), c(dt(x_left, df), rep(0, 200)),
        col = rgb(1, 0, 0, 0.3), border = NA)
polygon(c(x_right, rev(x_right)), c(dt(x_right, df), rep(0, 200)),
        col = rgb(1, 0, 0, 0.3), border = NA)

abline(v = t_beta1, col = "red", lwd = 2.5, lty = 2)
text(t_beta1 + 1.5, 0.15, paste0("t = ", round(t_beta1, 2)),
     col = "red", cex = 1.2, font = 2)
abline(v = c(-t_crit, t_crit), col = "gray40", lty = 3)
legend("topleft",
       legend = c("t(30) distribution", "Rejection region (α = 0.05)", "Our t-statistic"),
       col = c("steelblue", rgb(1,0,0,0.3), "red"),
       lwd = c(2, 8, 2.5), lty = c(1, 1, 2), bty = "n")

Confidence intervals (by hand)

\[\hat{\beta}_j \pm t_{\alpha/2, \, n-2} \cdot \text{se}(\hat{\beta}_j)\]

cat("Critical value t_{0.025, 30} =", round(t_crit, 4), "\n\n")
Critical value t_{0.025, 30} = 2.0423 
ci_beta1 <- beta1_hat + c(-1, 1) * t_crit * se_beta1

cat("95% CI for β₁: [", round(ci_beta1[1], 4), ",", round(ci_beta1[2], 4), "]\n")
95% CI for β₁: [ -6.4863 , -4.2026 ]
cat("\nNote: 0 is NOT inside the CI for β₁ → we reject H₀: β₁ = 0")

Note: 0 is NOT inside the CI for β₁ → we reject H₀: β₁ = 0

Comparing with summary()

summary(model)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Confidence intervals: automatic

confint(model)
                2.5 %    97.5 %
(Intercept) 33.450500 41.119753
wt          -6.486308 -4.202635

Everything matches our hand calculations.

\(R^2\) decomposition

SST <- sum((y - y_bar)^2)
SSE <- sum((y_hat - y_bar)^2)
R2 <- 1 - SSR / SST
cat("R² =", round(R2, 4), "\n")
R² = 0.7528 
cat("Weight explains", round(R2 * 100, 1), "% of the variation in mpg.\n")
Weight explains 75.3 % of the variation in mpg.

\(R^2\) decomposition

Code
par(mar = c(4, 4, 3, 1))
barplot(c(SST = SST, SSE = SSE, SSR = SSR),
        col = c("gray70", "steelblue", "coral"),
        main = "Decomposition: SST = SSE + SSR", ylab = "Sum of Squares", border = NA)
text(0.7, SST/2, paste0("SST\n", round(SST, 1)), cex = 1, font = 2)
text(1.9, SSE/2, paste0("SSE\n", round(SSE, 1), "\n(", round(SSE/SST*100,1), "%)"),
     cex = 1, font = 2)
text(3.1, SSR/2, paste0("SSR\n", round(SSR, 1), "\n(", round(SSR/SST*100,1), "%)"),
     cex = 1, font = 2)

Part VI: ⭐ \(\hat{\beta}_1\) Is a Random Variable

The key insight

We said \(\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\) depends on the sample.

If we took a different sample from the same population, we would get a different \(\hat{\beta}_1\).

Let’s simulate this!

  • True model: \(y = 5 + 3x + u\), where \(u \sim N(0, 16)\)
  • Draw 1000 samples of size \(n = 50\)
  • Estimate \(\hat{\beta}_1\) each time

Running the simulation

beta0_true <- 5;  beta1_true <- 3;  sigma_true <- 4
set.seed(42)

n_sim <- 1000;  n_obs <- 50
x_sim <- runif(n_obs, min = 0, max = 10)

beta1_estimates <- numeric(n_sim)
beta0_estimates <- numeric(n_sim)

for (i in 1:n_sim) {
  u <- rnorm(n_obs, mean = 0, sd = sigma_true)
  y_sim <- beta0_true + beta1_true * x_sim + u
  fit <- lm(y_sim ~ x_sim)
  beta0_estimates[i] <- coef(fit)[1]
  beta1_estimates[i] <- coef(fit)[2]
}

cat("Done:", n_sim, "regressions, each with n =", n_obs, "observations.\n")
Done: 1000 regressions, each with n = 50 observations.

20 samples → 20 different lines

Code
plot(NULL, xlim = c(0, 10), ylim = c(-5, 45),
     main = "20 Different Samples → 20 Different Regression Lines",
     xlab = "x", ylab = "y")
for (i in 1:20) {
  abline(a = beta0_estimates[i], b = beta1_estimates[i],
         col = rgb(0.3, 0.3, 0.8, 0.3), lwd = 1.5)
}
abline(a = beta0_true, b = beta1_true, col = "red", lwd = 3)
legend("topleft",
       legend = c("True line: y = 5 + 3x", "Estimated lines (one per sample)"),
       col = c("red", rgb(0.3, 0.3, 0.8, 0.5)), lwd = c(3, 2), bty = "n")
grid()

The sampling distribution of \(\hat{\beta}_1\)

Code
se_theory <- sigma_true / sqrt(sum((x_sim - mean(x_sim))^2))

hist(beta1_estimates, breaks = 40, freq = FALSE,
     col = "steelblue", border = "white",
     main = expression(paste("Sampling Distribution of ", hat(beta)[1],
                             " (1000 simulations)")),
     xlab = expression(hat(beta)[1]),
     xlim = c(beta1_true - 1.5, beta1_true + 1.5))
abline(v = beta1_true, col = "red", lwd = 3)
abline(v = mean(beta1_estimates), col = "orange", lwd = 2, lty = 2)
curve(dnorm(x, mean = beta1_true, sd = se_theory), add = TRUE,
      col = "red", lwd = 2, lty = 2)
legend("topright",
       legend = c(paste0("True β₁ = ", beta1_true),
                  paste0("Mean(β̂₁) = ", round(mean(beta1_estimates), 4)),
                  "Theoretical N(β₁, σ²/Sxx)"),
       col = c("red", "orange", "red"),
       lwd = c(3, 2, 2), lty = c(1, 2, 2), bty = "n")

Unbiasedness confirmed

cat("True β₁              =", beta1_true, "\n")
True β₁              = 3 
cat("Mean of β̂₁           =", round(mean(beta1_estimates), 4), "\n")
Mean of β̂₁           = 2.9945 
cat("  → UNBIASED: E[β̂₁] ≈ β₁ ✓\n\n")
  → UNBIASED: E[β̂₁] ≈ β₁ ✓
cat("Std Dev of β̂₁ (simulated) =", round(sd(beta1_estimates), 4), "\n")
Std Dev of β̂₁ (simulated) = 0.1932 
cat("Std Dev of β̂₁ (theory)    =", round(se_theory, 4), "\n")
Std Dev of β̂₁ (theory)    = 0.1882 
cat("  → MATCHES: se(β̂₁) = σ/√Sxx ✓\n\n")
  → MATCHES: se(β̂₁) = σ/√Sxx ✓
cat("True β₀              =", beta0_true, "\n")
True β₀              = 5 
cat("Mean of β̂₀           =", round(mean(beta0_estimates), 4), "\n")
Mean of β̂₀           = 5.022 
cat("  → UNBIASED: E[β̂₀] ≈ β₀ ✓\n")
  → UNBIASED: E[β̂₀] ≈ β₀ ✓

Key takeaways from the simulation

  • Each sample gives a different \(\hat{\beta}_1\) — the estimator is a random variable
  • On average, \(\hat{\beta}_1\) equals the true \(\beta_1\)unbiasedness
  • The spread of \(\hat{\beta}_1\) is captured by the standard error
  • The histogram matches the theoretical normal → this justifies our t-tests and CIs

Part VII: What Affects Precision?

The formula tells us everything

\[\text{se}(\hat{\beta}_1) = \frac{\sigma}{\sqrt{S_{xx}}}\]

Three things affect precision:

  1. Sample size \(n\) → more data → larger \(S_{xx}\) → smaller se
  2. Spread of \(x\) → more variation → larger \(S_{xx}\) → smaller se
  3. Error variance \(\sigma^2\) → noisier data → larger se

Effect of sample size

Code
set.seed(123)
sample_sizes <- c(20, 50, 100, 500)
n_reps <- 1000

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
for (nn in sample_sizes) {
  x_temp <- runif(nn, 0, 10)
  b1_vec <- numeric(n_reps)
  for (j in 1:n_reps) {
    u_temp <- rnorm(nn, 0, sigma_true)
    y_temp <- beta0_true + beta1_true * x_temp + u_temp
    b1_vec[j] <- coef(lm(y_temp ~ x_temp))[2]
  }
  hist(b1_vec, breaks = 35, freq = FALSE, col = "steelblue", border = "white",
       main = paste0("n = ", nn, "  (sd = ", round(sd(b1_vec), 3), ")"),
       xlab = expression(hat(beta)[1]), xlim = c(1.5, 4.5))
  abline(v = beta1_true, col = "red", lwd = 2)
}

Effect of noise level \(\sigma\)

Code
set.seed(456)
sigmas <- c(1, 4, 8, 16)
nn_fixed <- 50;  x_temp <- runif(nn_fixed, 0, 10)

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
for (ss in sigmas) {
  b1_vec <- numeric(n_reps)
  for (j in 1:n_reps) {
    u_temp <- rnorm(nn_fixed, 0, ss)
    y_temp <- beta0_true + beta1_true * x_temp + u_temp
    b1_vec[j] <- coef(lm(y_temp ~ x_temp))[2]
  }
  hist(b1_vec, breaks = 35, freq = FALSE, col = "coral", border = "white",
       main = bquote(sigma == .(ss) ~ " (sd = " * .(round(sd(b1_vec), 3)) * ")"),
       xlab = expression(hat(beta)[1]), xlim = c(-1, 7))
  abline(v = beta1_true, col = "red", lwd = 2)
}

Observation

More data → tighter distribution.

More noise → wider distribution.

In both cases, the estimator remains centered on the true value (unbiased).

Part VIII: Subsampling from Real Data

Different subsamples, different estimates

Code
set.seed(2024)
plot(mtcars$wt, mtcars$mpg, pch = 19, col = "gray70", cex = 1.2,
     main = "8 Random Subsamples (n = 20) from mtcars",
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")

colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
            "#FF7F00", "#A65628", "#F781BF", "#999999")
sub_betas <- numeric(8)
for (i in 1:8) {
  idx <- sample(1:32, size = 20, replace = FALSE)
  fit_sub <- lm(mpg ~ wt, data = mtcars[idx, ])
  abline(fit_sub, col = colors[i], lwd = 2)
  sub_betas[i] <- coef(fit_sub)[2]
}
abline(model, col = "black", lwd = 3, lty = 2)
legend("topright",
       legend = c("Full sample (n=32)", "Subsample lines (n=20)"),
       col = c("black", "steelblue"), lwd = c(3, 2), lty = c(2, 1), bty = "n")
grid()

Part IX: Residual Diagnostics

Four diagnostic plots

Code
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

plot(fitted(model), residuals(model), pch = 19, col = "steelblue",
     main = "Residuals vs Fitted", xlab = "Fitted Values (ŷ)", ylab = "Residuals (û)")
abline(h = 0, col = "red", lwd = 2, lty = 2); grid()

hist(residuals(model), breaks = 10, freq = FALSE,
     col = "steelblue", border = "white",
     main = "Distribution of Residuals", xlab = "Residuals")
curve(dnorm(x, mean = 0, sd = sigma_hat), add = TRUE, col = "red", lwd = 2)

qqnorm(residuals(model), pch = 19, col = "steelblue", main = "Normal Q-Q Plot")
qqline(residuals(model), col = "red", lwd = 2)

plot(fitted(model), sqrt(abs(rstandard(model))), pch = 19, col = "steelblue",
     main = "Scale-Location", xlab = "Fitted Values",
     ylab = expression(sqrt("Standardized Residuals")))
grid()

How to read them

  • Residuals vs Fitted: Random scatter around 0 = good. Curve or funnel = problems.
  • Histogram / Q-Q: Check approximate normality.
  • Scale-Location: Roughly constant spread = homoscedasticity.

Part X: Confidence & Prediction Bands

Two types of intervals

Code
x_grid <- data.frame(wt = seq(min(mtcars$wt) - 0.2, max(mtcars$wt) + 0.2,
                               length.out = 200))
pred_ci <- predict(model, newdata = x_grid, interval = "confidence")
pred_pi <- predict(model, newdata = x_grid, interval = "prediction")

plot(mtcars$wt, mtcars$mpg, pch = 19, col = "steelblue", cex = 1.2,
     main = "Confidence and Prediction Intervals",
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
     ylim = c(min(pred_pi[,"lwr"]) - 1, max(pred_pi[,"upr"]) + 1))
polygon(c(x_grid$wt, rev(x_grid$wt)),
        c(pred_pi[,"lwr"], rev(pred_pi[,"upr"])),
        col = rgb(1, 0.6, 0.4, 0.2), border = NA)
polygon(c(x_grid$wt, rev(x_grid$wt)),
        c(pred_ci[,"lwr"], rev(pred_ci[,"upr"])),
        col = rgb(0.3, 0.5, 0.8, 0.3), border = NA)
lines(x_grid$wt, pred_ci[,"fit"], col = "red", lwd = 2.5)
points(mtcars$wt, mtcars$mpg, pch = 19, col = "steelblue", cex = 1.2)
legend("topright",
       legend = c("OLS line", "95% CI for E[Y|X]", "95% PI for new Y"),
       col = c("red", rgb(0.3,0.5,0.8,0.5), rgb(1,0.6,0.4,0.4)),
       lwd = c(2.5, 8, 8), bty = "n")
grid()

Interpretation

Confidence interval (blue): Where the true regression line likely is. Narrower near \(\bar{x}\).

Prediction interval (orange): Where a new observation would fall. Always wider because it includes both line uncertainty and the noise \(\sigma^2\).

Summary

Key R functions

Task Code
Fit model lm(y ~ x, data = df)
Coefficients coef(model)
Full output summary(model)
CIs confint(model)
Fitted values fitted(model)
Residuals residuals(model)
Predict predict(model, newdata)
\(R^2\) summary(model)$r.squared

Takeaways

  1. OLS minimizes the sum of squared residuals
  2. \(\hat{\beta}_1\) is a random variable — it changes with the sample
  3. Unbiasedness: \(E[\hat{\beta}_1] = \beta_1\) — on average, OLS gets it right
  4. Standard errors decrease with more data and less noise
  5. t-tests and CIs follow from the sampling distribution
  6. R makes all of this easy — but understanding what’s behind it is essential!